{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# 常见解决方案\n",
    "1. 部分数据缺失，存在大量的空值。\n",
    "  a. 填充或者删除缺失数据\n",
    "  b. 检测缺失数据isnull，删除数据dropna，填充数据fillna\n",
    "2. 数据存在重复值\n",
    "  a. 检测重复值duplicated\n",
    "  b. 删除重复值drop_duplicates\n",
    "3. 数据类型不统一\n",
    "  a. 数据类型转换\n",
    "4. 部分数据包含数值和字符串\n",
    "  a. 字符串操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:09:15.181117Z",
     "iopub.status.busy": "2024-11-29T10:09:15.179947Z",
     "iopub.status.idle": "2024-11-29T10:09:16.272775Z",
     "shell.execute_reply": "2024-11-29T10:09:16.272775Z",
     "shell.execute_reply.started": "2024-11-29T10:09:15.181117Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# 处理缺失数据\n",
    "|方法\t|说明|\n",
    "|---|---|\n",
    "|dropna\t|根据各标签的值中是否存在缺失数据对轴标签进行过滤，可通过阈值调节对缺失值的容忍度|\n",
    "|fillna\t|用指定值或插值方法（如ffilI或bfill）填充缺失数据|\n",
    "|isnull\t|返回一个含有布尔值的对象，这些布尔值表示哪些值是缺失值/NA，该对象的类型与源类型一样|\n",
    "|notnull|\tisnull的否定式|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 找到缺失数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.426668Z",
     "iopub.status.busy": "2024-11-29T09:28:46.426668Z",
     "iopub.status.idle": "2024-11-29T09:28:46.456703Z",
     "shell.execute_reply": "2024-11-29T09:28:46.455346Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.426668Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      a\n",
       "1      b\n",
       "2    NaN\n",
       "3      d\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data1 = pd.Series(['a','b',np.nan,'d'])\n",
    "data1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.457787Z",
     "iopub.status.busy": "2024-11-29T09:28:46.456703Z",
     "iopub.status.idle": "2024-11-29T09:28:46.470532Z",
     "shell.execute_reply": "2024-11-29T09:28:46.470532Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.457787Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data1[0] = None\n",
    "data1.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 滤除缺失数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.471739Z",
     "iopub.status.busy": "2024-11-29T09:28:46.471739Z",
     "iopub.status.idle": "2024-11-29T09:28:46.487540Z",
     "shell.execute_reply": "2024-11-29T09:28:46.486533Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.471739Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    b\n",
       "3    d\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 行中有一个列为nan时，就丢弃整个行\n",
    "data1.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.488535Z",
     "iopub.status.busy": "2024-11-29T09:28:46.488535Z",
     "iopub.status.idle": "2024-11-29T09:28:46.502486Z",
     "shell.execute_reply": "2024-11-29T09:28:46.501353Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.488535Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    b\n",
       "3    d\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data1[data1.notnull()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.503489Z",
     "iopub.status.busy": "2024-11-29T09:28:46.502486Z",
     "iopub.status.idle": "2024-11-29T09:28:46.517450Z",
     "shell.execute_reply": "2024-11-29T09:28:46.516945Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.503489Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.7</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.7  7.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data2 = pd.DataFrame([[1.,6.5,3.],[1.,np.nan,np.nan],[np.nan,np.nan,np.nan],[np.nan,6.7,7.]])\n",
    "data2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.518828Z",
     "iopub.status.busy": "2024-11-29T09:28:46.517450Z",
     "iopub.status.idle": "2024-11-29T09:28:46.533746Z",
     "shell.execute_reply": "2024-11-29T09:28:46.532565Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.518828Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data2.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.533746Z",
     "iopub.status.busy": "2024-11-29T09:28:46.533746Z",
     "iopub.status.idle": "2024-11-29T09:28:46.549208Z",
     "shell.execute_reply": "2024-11-29T09:28:46.548207Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.533746Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.7</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "3  NaN  6.7  7.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 丢弃全为nan的行\n",
    "data2.dropna(how='all')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.551213Z",
     "iopub.status.busy": "2024-11-29T09:28:46.550210Z",
     "iopub.status.idle": "2024-11-29T09:28:46.564414Z",
     "shell.execute_reply": "2024-11-29T09:28:46.564414Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.551213Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.7</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.7  7.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 丢弃全为nan的列，结果没有任何一列被丢弃\n",
    "data2.dropna(axis=1,how='all')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.566414Z",
     "iopub.status.busy": "2024-11-29T09:28:46.566414Z",
     "iopub.status.idle": "2024-11-29T09:28:46.581054Z",
     "shell.execute_reply": "2024-11-29T09:28:46.579823Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.566414Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.459354</td>\n",
       "      <td>-0.044522</td>\n",
       "      <td>-0.229352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.089595</td>\n",
       "      <td>-0.393446</td>\n",
       "      <td>-0.644506</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.376547</td>\n",
       "      <td>1.303473</td>\n",
       "      <td>0.568497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.080471</td>\n",
       "      <td>0.335174</td>\n",
       "      <td>0.584950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.448781</td>\n",
       "      <td>-0.649418</td>\n",
       "      <td>-0.038726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.758286</td>\n",
       "      <td>0.429802</td>\n",
       "      <td>0.893894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.425907</td>\n",
       "      <td>1.562327</td>\n",
       "      <td>3.328144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.459354 -0.044522 -0.229352\n",
       "1 -0.089595 -0.393446 -0.644506\n",
       "2  0.376547  1.303473  0.568497\n",
       "3  0.080471  0.335174  0.584950\n",
       "4  0.448781 -0.649418 -0.038726\n",
       "5  0.758286  0.429802  0.893894\n",
       "6 -0.425907  1.562327  3.328144"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(np.random.randn(7,3))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.582339Z",
     "iopub.status.busy": "2024-11-29T09:28:46.581054Z",
     "iopub.status.idle": "2024-11-29T09:28:46.596354Z",
     "shell.execute_reply": "2024-11-29T09:28:46.595849Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.582339Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.459354</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.229352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.089595</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.644506</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.376547</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.568497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.080471</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.584950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.448781</td>\n",
       "      <td>-0.649418</td>\n",
       "      <td>-0.038726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.758286</td>\n",
       "      <td>0.429802</td>\n",
       "      <td>0.893894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.425907</td>\n",
       "      <td>1.562327</td>\n",
       "      <td>3.328144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.459354       NaN -0.229352\n",
       "1 -0.089595       NaN -0.644506\n",
       "2  0.376547       NaN  0.568497\n",
       "3  0.080471       NaN  0.584950\n",
       "4  0.448781 -0.649418 -0.038726\n",
       "5  0.758286  0.429802  0.893894\n",
       "6 -0.425907  1.562327  3.328144"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[:4,1] = np.nan\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.597855Z",
     "iopub.status.busy": "2024-11-29T09:28:46.597855Z",
     "iopub.status.idle": "2024-11-29T09:28:46.612487Z",
     "shell.execute_reply": "2024-11-29T09:28:46.611932Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.597855Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.459354</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.089595</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.376547</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.568497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.080471</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.584950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.448781</td>\n",
       "      <td>-0.649418</td>\n",
       "      <td>-0.038726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.758286</td>\n",
       "      <td>0.429802</td>\n",
       "      <td>0.893894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.425907</td>\n",
       "      <td>1.562327</td>\n",
       "      <td>3.328144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.459354       NaN       NaN\n",
       "1 -0.089595       NaN       NaN\n",
       "2  0.376547       NaN  0.568497\n",
       "3  0.080471       NaN  0.584950\n",
       "4  0.448781 -0.649418 -0.038726\n",
       "5  0.758286  0.429802  0.893894\n",
       "6 -0.425907  1.562327  3.328144"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[:2,2] = np.nan\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.613685Z",
     "iopub.status.busy": "2024-11-29T09:28:46.613685Z",
     "iopub.status.idle": "2024-11-29T09:28:46.628706Z",
     "shell.execute_reply": "2024-11-29T09:28:46.627562Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.613685Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.448781</td>\n",
       "      <td>-0.649418</td>\n",
       "      <td>-0.038726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.758286</td>\n",
       "      <td>0.429802</td>\n",
       "      <td>0.893894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.425907</td>\n",
       "      <td>1.562327</td>\n",
       "      <td>3.328144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "4  0.448781 -0.649418 -0.038726\n",
       "5  0.758286  0.429802  0.893894\n",
       "6 -0.425907  1.562327  3.328144"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.629881Z",
     "iopub.status.busy": "2024-11-29T09:28:46.629881Z",
     "iopub.status.idle": "2024-11-29T09:28:46.642704Z",
     "shell.execute_reply": "2024-11-29T09:28:46.642704Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.629881Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.376547</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.568497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.080471</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.584950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.448781</td>\n",
       "      <td>-0.649418</td>\n",
       "      <td>-0.038726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.758286</td>\n",
       "      <td>0.429802</td>\n",
       "      <td>0.893894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.425907</td>\n",
       "      <td>1.562327</td>\n",
       "      <td>3.328144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "2  0.376547       NaN  0.568497\n",
       "3  0.080471       NaN  0.584950\n",
       "4  0.448781 -0.649418 -0.038726\n",
       "5  0.758286  0.429802  0.893894\n",
       "6 -0.425907  1.562327  3.328144"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# thresh=2 保留含有2个及以上的非空值的行\n",
    "df.dropna(thresh=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 填充数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.644098Z",
     "iopub.status.busy": "2024-11-29T09:28:46.644098Z",
     "iopub.status.idle": "2024-11-29T09:28:46.659241Z",
     "shell.execute_reply": "2024-11-29T09:28:46.657963Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.644098Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.459354</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.089595</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.376547</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.568497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.080471</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.584950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.448781</td>\n",
       "      <td>-0.649418</td>\n",
       "      <td>-0.038726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.758286</td>\n",
       "      <td>0.429802</td>\n",
       "      <td>0.893894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.425907</td>\n",
       "      <td>1.562327</td>\n",
       "      <td>3.328144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.459354  0.000000  0.000000\n",
       "1 -0.089595  0.000000  0.000000\n",
       "2  0.376547  0.000000  0.568497\n",
       "3  0.080471  0.000000  0.584950\n",
       "4  0.448781 -0.649418 -0.038726\n",
       "5  0.758286  0.429802  0.893894\n",
       "6 -0.425907  1.562327  3.328144"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.659241Z",
     "iopub.status.busy": "2024-11-29T09:28:46.659241Z",
     "iopub.status.idle": "2024-11-29T09:28:46.674257Z",
     "shell.execute_reply": "2024-11-29T09:28:46.674257Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.659241Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.459354</td>\n",
       "      <td>0.900000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.089595</td>\n",
       "      <td>0.900000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.376547</td>\n",
       "      <td>0.900000</td>\n",
       "      <td>0.568497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.080471</td>\n",
       "      <td>0.900000</td>\n",
       "      <td>0.584950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.448781</td>\n",
       "      <td>-0.649418</td>\n",
       "      <td>-0.038726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.758286</td>\n",
       "      <td>0.429802</td>\n",
       "      <td>0.893894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.425907</td>\n",
       "      <td>1.562327</td>\n",
       "      <td>3.328144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.459354  0.900000  0.000000\n",
       "1 -0.089595  0.900000  0.000000\n",
       "2  0.376547  0.900000  0.568497\n",
       "3  0.080471  0.900000  0.584950\n",
       "4  0.448781 -0.649418 -0.038726\n",
       "5  0.758286  0.429802  0.893894\n",
       "6 -0.425907  1.562327  3.328144"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将列索引为1的na填充为0.9,索引为2的列填充为0.0\n",
    "df.fillna({1:0.9,2:0},inplace=True)\n",
    "df#inplace参数 就地修改,即直接修改原数组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.676367Z",
     "iopub.status.busy": "2024-11-29T09:28:46.675864Z",
     "iopub.status.idle": "2024-11-29T09:28:46.689858Z",
     "shell.execute_reply": "2024-11-29T09:28:46.689858Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.676367Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.083642</td>\n",
       "      <td>-0.535752</td>\n",
       "      <td>0.815793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.543724</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.665208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.178611</td>\n",
       "      <td>0.277309</td>\n",
       "      <td>-0.103791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.009735</td>\n",
       "      <td>-0.190324</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.309928</td>\n",
       "      <td>-0.839473</td>\n",
       "      <td>0.354909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.047153</td>\n",
       "      <td>1.808260</td>\n",
       "      <td>0.332167</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.083642 -0.535752  0.815793\n",
       "1 -0.543724 -1.500269  0.665208\n",
       "2 -0.178611  0.277309 -0.103791\n",
       "3 -1.009735 -0.190324  0.094875\n",
       "4  0.309928 -0.839473  0.354909\n",
       "5  0.047153  1.808260  0.332167"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame(np.random.randn(6,3))\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.692187Z",
     "iopub.status.busy": "2024-11-29T09:28:46.691187Z",
     "iopub.status.idle": "2024-11-29T09:28:46.706595Z",
     "shell.execute_reply": "2024-11-29T09:28:46.705589Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.692187Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.083642</td>\n",
       "      <td>-0.535752</td>\n",
       "      <td>0.815793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.543724</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.665208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.178611</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.103791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.009735</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.309928</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.047153</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.083642 -0.535752  0.815793\n",
       "1 -0.543724 -1.500269  0.665208\n",
       "2 -0.178611       NaN -0.103791\n",
       "3 -1.009735       NaN  0.094875\n",
       "4  0.309928       NaN       NaN\n",
       "5  0.047153       NaN       NaN"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.iloc[2:,1] = np.nan\n",
    "df2.iloc[4:,2] = np.nan\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:28:46.707611Z",
     "iopub.status.busy": "2024-11-29T09:28:46.706595Z",
     "iopub.status.idle": "2024-11-29T09:28:46.720578Z",
     "shell.execute_reply": "2024-11-29T09:28:46.720578Z",
     "shell.execute_reply.started": "2024-11-29T09:28:46.707611Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\98614\\AppData\\Local\\Temp\\ipykernel_21444\\1937335893.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.\n",
      "  df2.fillna(method='ffill')\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.083642</td>\n",
       "      <td>-0.535752</td>\n",
       "      <td>0.815793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.543724</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.665208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.178611</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>-0.103791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.009735</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.309928</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.047153</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.083642 -0.535752  0.815793\n",
       "1 -0.543724 -1.500269  0.665208\n",
       "2 -0.178611 -1.500269 -0.103791\n",
       "3 -1.009735 -1.500269  0.094875\n",
       "4  0.309928 -1.500269  0.094875\n",
       "5  0.047153 -1.500269  0.094875"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# method='ffill' 列的维度，为nan的值会复制上一行非nan的值\n",
    "df2.fillna(method='ffill')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T09:29:00.132284Z",
     "iopub.status.busy": "2024-11-29T09:29:00.132284Z",
     "iopub.status.idle": "2024-11-29T09:29:00.151210Z",
     "shell.execute_reply": "2024-11-29T09:29:00.151210Z",
     "shell.execute_reply.started": "2024-11-29T09:29:00.132284Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\98614\\AppData\\Local\\Temp\\ipykernel_21444\\2557755900.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.\n",
      "  df2.fillna(method='ffill',limit=2)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.083642</td>\n",
       "      <td>-0.535752</td>\n",
       "      <td>0.815793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.543724</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.665208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.178611</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>-0.103791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.009735</td>\n",
       "      <td>-1.500269</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.309928</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.047153</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.094875</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -0.083642 -0.535752  0.815793\n",
       "1 -0.543724 -1.500269  0.665208\n",
       "2 -0.178611 -1.500269 -0.103791\n",
       "3 -1.009735 -1.500269  0.094875\n",
       "4  0.309928       NaN  0.094875\n",
       "5  0.047153       NaN  0.094875"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# method='ffill' 列的维度，为nan的值会复制上一行非nan的值，这种操作要填充2个\n",
    "df2.fillna(method='ffill',limit=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# 数据过滤filter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:29:33.693483Z",
     "iopub.status.busy": "2024-11-29T09:29:33.693483Z",
     "iopub.status.idle": "2024-11-29T09:29:33.704809Z",
     "shell.execute_reply": "2024-11-29T09:29:33.703591Z",
     "shell.execute_reply.started": "2024-11-29T09:29:33.693483Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>red</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>blue</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>red</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>green</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>green</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>blue</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>None</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>green</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   color  price\n",
       "0    red     20\n",
       "1   blue     15\n",
       "2    red     20\n",
       "3  green     18\n",
       "4  green     18\n",
       "5   blue     22\n",
       "6   None     30\n",
       "7    NaN     30\n",
       "8  green     22"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    data={\n",
    "        \"color\": [\n",
    "            \"red\",\n",
    "            \"blue\",\n",
    "            \"red\",\n",
    "            \"green\",\n",
    "            \"green\",\n",
    "            \"blue\",\n",
    "            None,\n",
    "            np.nan,\n",
    "            \"green\",\n",
    "        ],\n",
    "        \"price\": [20, 15, 20, 18, 18, 22, 30, 30, 22],\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:29:51.993990Z",
     "iopub.status.busy": "2024-11-29T09:29:51.993990Z",
     "iopub.status.idle": "2024-11-29T09:29:52.006558Z",
     "shell.execute_reply": "2024-11-29T09:29:52.006558Z",
     "shell.execute_reply.started": "2024-11-29T09:29:51.993990Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   price\n",
       "0     20\n",
       "1     15\n",
       "2     20\n",
       "3     18\n",
       "4     18\n",
       "5     22\n",
       "6     30\n",
       "7     30\n",
       "8     22"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.filter(items=[\"price\"])  # 参数意思，保留数据price"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:30:01.942204Z",
     "iopub.status.busy": "2024-11-29T09:30:01.942204Z",
     "iopub.status.idle": "2024-11-29T09:30:01.962687Z",
     "shell.execute_reply": "2024-11-29T09:30:01.961687Z",
     "shell.execute_reply.started": "2024-11-29T09:30:01.942204Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "      <th>price</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>red</td>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>blue</td>\n",
       "      <td>15</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>red</td>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>green</td>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>green</td>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>blue</td>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>None</td>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>green</td>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   color  price  size\n",
       "0    red     20  1024\n",
       "1   blue     15  1024\n",
       "2    red     20  1024\n",
       "3  green     18  1024\n",
       "4  green     18  1024\n",
       "5   blue     22  1024\n",
       "6   None     30  1024\n",
       "7    NaN     30  1024\n",
       "8  green     22  1024"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"size\"] = 1024  # 广播\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:30:11.595631Z",
     "iopub.status.busy": "2024-11-29T09:30:11.595631Z",
     "iopub.status.idle": "2024-11-29T09:30:11.611928Z",
     "shell.execute_reply": "2024-11-29T09:30:11.611423Z",
     "shell.execute_reply.started": "2024-11-29T09:30:11.595631Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   price  size\n",
       "0     20  1024\n",
       "1     15  1024\n",
       "2     20  1024\n",
       "3     18  1024\n",
       "4     18  1024\n",
       "5     22  1024\n",
       "6     30  1024\n",
       "7     30  1024\n",
       "8     22  1024"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.filter(like=\"i\")  # 模糊匹配，保留了带有i这个字母的索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:30:21.049423Z",
     "iopub.status.busy": "2024-11-29T09:30:21.049423Z",
     "iopub.status.idle": "2024-11-29T09:30:21.058407Z",
     "shell.execute_reply": "2024-11-29T09:30:21.057902Z",
     "shell.execute_reply.started": "2024-11-29T09:30:21.049423Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "      <th>price</th>\n",
       "      <th>size</th>\n",
       "      <th>hello</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>red</td>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>blue</td>\n",
       "      <td>15</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>red</td>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>green</td>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>green</td>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>blue</td>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>None</td>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>green</td>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   color  price  size  hello\n",
       "0    red     20  1024    512\n",
       "1   blue     15  1024    512\n",
       "2    red     20  1024    512\n",
       "3  green     18  1024    512\n",
       "4  green     18  1024    512\n",
       "5   blue     22  1024    512\n",
       "6   None     30  1024    512\n",
       "7    NaN     30  1024    512\n",
       "8  green     22  1024    512"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"hello\"] = 512\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:30:29.906687Z",
     "iopub.status.busy": "2024-11-29T09:30:29.906687Z",
     "iopub.status.idle": "2024-11-29T09:30:29.921064Z",
     "shell.execute_reply": "2024-11-29T09:30:29.919862Z",
     "shell.execute_reply.started": "2024-11-29T09:30:29.906687Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   price  size\n",
       "0     20  1024\n",
       "1     15  1024\n",
       "2     20  1024\n",
       "3     18  1024\n",
       "4     18  1024\n",
       "5     22  1024\n",
       "6     30  1024\n",
       "7     30  1024\n",
       "8     22  1024"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 正则表达式，方式很多\n",
    "df.filter(regex=\"e$\")  # 正则表达式，正则表达式，限制e必须在最后"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:30:37.186537Z",
     "iopub.status.busy": "2024-11-29T09:30:37.186537Z",
     "iopub.status.idle": "2024-11-29T09:30:37.209520Z",
     "shell.execute_reply": "2024-11-29T09:30:37.208434Z",
     "shell.execute_reply.started": "2024-11-29T09:30:37.186537Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "      <th>size</th>\n",
       "      <th>hello</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>18</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>30</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>22</td>\n",
       "      <td>1024</td>\n",
       "      <td>512</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   price  size  hello\n",
       "0     20  1024    512\n",
       "1     15  1024    512\n",
       "2     20  1024    512\n",
       "3     18  1024    512\n",
       "4     18  1024    512\n",
       "5     22  1024    512\n",
       "6     30  1024    512\n",
       "7     30  1024    512\n",
       "8     22  1024    512"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.filter(regex=\"e\")  # 只要带有e全部选出来"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# 数据转换\n",
    "## 移除重复数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "5  two   4\n",
       "6  two   4"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame({'k1':['one','two']*3+['two'],\n",
    "                     'k2':[1,1,2,3,3,4,4]})\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4    False\n",
       "5    False\n",
       "6     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# duplicated()返回布尔型Series表示每行是否为重复行\n",
    "# 检查是否重复\n",
    "data.duplicated()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "5  two   4"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop_duplicates()过滤重复行  1、默认判断全部列  2、可指定按某些列判断\n",
    "# 删除重复数据\n",
    "data.drop_duplicates()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>v1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  v1\n",
       "0  one   1   0\n",
       "1  two   1   1\n",
       "2  one   2   2\n",
       "3  two   3   3\n",
       "4  one   3   4\n",
       "5  two   4   5\n",
       "6  two   4   6"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['v1'] = range(7)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>v1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  v1\n",
       "0  one   1   0\n",
       "1  two   1   1"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.drop_duplicates(['k1'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>v1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  v1\n",
       "1  two   1   1\n",
       "2  one   2   2\n",
       "4  one   3   4\n",
       "6  two   4   6"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# keep='last'，保留k2列重复值，的最后一行的那个值\n",
    "data.drop_duplicates(['k2'],keep='last')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "5  two   4\n",
       "6  two   4"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame({'k1':['one','two']*3+['two'],\n",
    "                     'k2':[1,1,2,3,3,4,4]})\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "6  two   4"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.drop_duplicates(keep='last')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## map映射元素转变"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>food</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Apple</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>banana</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>orange</td>\n",
       "      <td>3.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>apple</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Mango</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>tomato</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     food  price\n",
       "0   Apple    4.0\n",
       "1  banana    3.0\n",
       "2  orange    3.5\n",
       "3   apple    6.0\n",
       "4   Mango   12.0\n",
       "5  tomato    3.0"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame({'food': ['Apple', 'banana', 'orange','apple','Mango', 'tomato'],\n",
    "                    'price': [4, 3, 3.5, 6, 12,3]})\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     apple\n",
       "1    banana\n",
       "2    orange\n",
       "3     apple\n",
       "4     mango\n",
       "5    tomato\n",
       "Name: food, dtype: object"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "meat = {'apple':'fruit',\n",
    "       'banana':'fruit',\n",
    "       'orange':'fruit',\n",
    "       'mango':'fruit',\n",
    "       'tomato':'vagetables'}\n",
    "#值小写\n",
    "low = data['food'].str.lower()\n",
    "low"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>food</th>\n",
       "      <th>price</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Apple</td>\n",
       "      <td>4.0</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>banana</td>\n",
       "      <td>3.0</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>orange</td>\n",
       "      <td>3.5</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>apple</td>\n",
       "      <td>6.0</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Mango</td>\n",
       "      <td>12.0</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>tomato</td>\n",
       "      <td>3.0</td>\n",
       "      <td>vagetables</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     food  price       class\n",
       "0   Apple    4.0       fruit\n",
       "1  banana    3.0       fruit\n",
       "2  orange    3.5       fruit\n",
       "3   apple    6.0       fruit\n",
       "4   Mango   12.0       fruit\n",
       "5  tomato    3.0  vagetables"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 利用映射进行数据转换\n",
    "data['class'] = low.map(meat)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>food</th>\n",
       "      <th>price</th>\n",
       "      <th>class</th>\n",
       "      <th>class1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Apple</td>\n",
       "      <td>4.0</td>\n",
       "      <td>fruit</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>banana</td>\n",
       "      <td>3.0</td>\n",
       "      <td>fruit</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>orange</td>\n",
       "      <td>3.5</td>\n",
       "      <td>fruit</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>apple</td>\n",
       "      <td>6.0</td>\n",
       "      <td>fruit</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Mango</td>\n",
       "      <td>12.0</td>\n",
       "      <td>fruit</td>\n",
       "      <td>fruit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>tomato</td>\n",
       "      <td>3.0</td>\n",
       "      <td>vagetables</td>\n",
       "      <td>vagetables</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     food  price       class      class1\n",
       "0   Apple    4.0       fruit       fruit\n",
       "1  banana    3.0       fruit       fruit\n",
       "2  orange    3.5       fruit       fruit\n",
       "3   apple    6.0       fruit       fruit\n",
       "4   Mango   12.0       fruit       fruit\n",
       "5  tomato    3.0  vagetables  vagetables"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 根据map传入的函数对每行或每列进行转换\n",
    "# map 只能针对一列，就是Series，applymap 可以对dataframe 进行操作\n",
    "data['class1'] = data['food'].map(lambda x:meat[x.lower()])\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## apply映射元素转变"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:46:47.050089Z",
     "iopub.status.busy": "2024-11-29T09:46:47.048975Z",
     "iopub.status.idle": "2024-11-29T09:46:47.064927Z",
     "shell.execute_reply": "2024-11-29T09:46:47.064927Z",
     "shell.execute_reply.started": "2024-11-29T09:46:47.050089Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>人工智能</th>\n",
       "      <th>Tensorflow</th>\n",
       "      <th>Keras</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>X</th>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>0</td>\n",
       "      <td>50</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <td>2</td>\n",
       "      <td>50</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Y</th>\n",
       "      <td>50</td>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   人工智能  Tensorflow  Keras\n",
       "X    50           1      3\n",
       "B     0           0      2\n",
       "C     2           4      4\n",
       "D     2           6      4\n",
       "E     4           4      6\n",
       "F     0          50      4\n",
       "H    50           0      6\n",
       "I     9           4      6\n",
       "J     2          50      2\n",
       "Y    50           7      2"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 既可以操作Series又可以操作DataFrame\n",
    "df = pd.DataFrame(\n",
    "    data=np.random.randint(0, 10, size=(10, 3)),\n",
    "    columns=[\"Python\", \"Tensorflow\", \"Keras\"],\n",
    "    index=list(\"ABCDEFHIJK\"),\n",
    ")\n",
    "df.rename(\n",
    "    index={\"A\": \"X\", \"K\": \"Y\"}, columns={\"Python\": \"人工智能\"}, inplace=True  # 行索引  # 列索引修改   # 替换原数据\n",
    ")  \n",
    "df.replace(5, 50, inplace=True)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:35:57.566072Z",
     "iopub.status.busy": "2024-11-29T09:35:57.564893Z",
     "iopub.status.idle": "2024-11-29T09:35:57.575114Z",
     "shell.execute_reply": "2024-11-29T09:35:57.574015Z",
     "shell.execute_reply.started": "2024-11-29T09:35:57.566072Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "X    102\n",
       "B    100\n",
       "C    102\n",
       "D    109\n",
       "E    106\n",
       "F    104\n",
       "H    109\n",
       "I    150\n",
       "J    103\n",
       "Y    100\n",
       "Name: 人工智能, dtype: int64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 既可以操作Series又可以操作DataFrame\n",
    "df[\"人工智能\"].apply(lambda x: x + 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:36:44.273211Z",
     "iopub.status.busy": "2024-11-29T09:36:44.273211Z",
     "iopub.status.idle": "2024-11-29T09:36:44.281903Z",
     "shell.execute_reply": "2024-11-29T09:36:44.281903Z",
     "shell.execute_reply.started": "2024-11-29T09:36:44.273211Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>人工智能</th>\n",
       "      <th>Tensorflow</th>\n",
       "      <th>Keras</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>X</th>\n",
       "      <td>1002</td>\n",
       "      <td>1001</td>\n",
       "      <td>1002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>1000</td>\n",
       "      <td>1050</td>\n",
       "      <td>1004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>1002</td>\n",
       "      <td>1050</td>\n",
       "      <td>1001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>1009</td>\n",
       "      <td>1009</td>\n",
       "      <td>1008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>1006</td>\n",
       "      <td>1007</td>\n",
       "      <td>1007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>1004</td>\n",
       "      <td>1000</td>\n",
       "      <td>1001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <td>1009</td>\n",
       "      <td>1003</td>\n",
       "      <td>1008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>1050</td>\n",
       "      <td>1006</td>\n",
       "      <td>1008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <td>1003</td>\n",
       "      <td>1050</td>\n",
       "      <td>1002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Y</th>\n",
       "      <td>1000</td>\n",
       "      <td>1006</td>\n",
       "      <td>1002</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   人工智能  Tensorflow  Keras\n",
       "X  1002        1001   1002\n",
       "B  1000        1050   1004\n",
       "C  1002        1050   1001\n",
       "D  1009        1009   1008\n",
       "E  1006        1007   1007\n",
       "F  1004        1000   1001\n",
       "H  1009        1003   1008\n",
       "I  1050        1006   1008\n",
       "J  1003        1050   1002\n",
       "Y  1000        1006   1002"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.apply(lambda x: x + 1000)  # apply对 所有的数据进行映射"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:37:05.389019Z",
     "iopub.status.busy": "2024-11-29T09:37:05.387818Z",
     "iopub.status.idle": "2024-11-29T09:37:05.411095Z",
     "shell.execute_reply": "2024-11-29T09:37:05.410091Z",
     "shell.execute_reply.started": "2024-11-29T09:37:05.389019Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>人工智能</th>\n",
       "      <th>Tensorflow</th>\n",
       "      <th>Keras</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.5</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>50.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>14.9</td>\n",
       "      <td>22.1</td>\n",
       "      <td>3.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   人工智能  Tensorflow  Keras\n",
       "0   3.5         6.5    3.0\n",
       "1  10.0        10.0   10.0\n",
       "2   0.0         0.0    1.0\n",
       "3  50.0        50.0    8.0\n",
       "4  14.9        22.1    3.1"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def convert(x):\n",
    "    return (x.median(), x.count(), x.min(), x.max(), x.std())  # 返回中位数，返回的是计数,最小值，最大值，标准差\n",
    "\n",
    "df.apply(convert).round(1)  # 默认操作列数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:39:39.718718Z",
     "iopub.status.busy": "2024-11-29T09:39:39.718718Z",
     "iopub.status.idle": "2024-11-29T09:39:39.737746Z",
     "shell.execute_reply": "2024-11-29T09:39:39.736744Z",
     "shell.execute_reply.started": "2024-11-29T09:39:39.718718Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "X     (2.0, 3, 1, 2, 0.5773502691896257)\n",
       "B    (4.0, 3, 0, 50, 27.784887978899608)\n",
       "C     (2.0, 3, 1, 50, 28.00595174839329)\n",
       "D     (9.0, 3, 8, 9, 0.5773502691896257)\n",
       "E     (7.0, 3, 6, 7, 0.5773502691896258)\n",
       "F     (1.0, 3, 0, 4, 2.0816659994661326)\n",
       "H     (8.0, 3, 3, 9, 3.2145502536643185)\n",
       "I      (8.0, 3, 6, 50, 24.8461935381123)\n",
       "J     (3.0, 3, 2, 50, 27.42869543622761)\n",
       "Y      (2.0, 3, 0, 6, 3.055050463303893)\n",
       "dtype: object"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.apply(convert, axis=1) # axis = 1，操作数据就是行数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:46:49.699348Z",
     "iopub.status.busy": "2024-11-29T09:46:49.699348Z",
     "iopub.status.idle": "2024-11-29T09:46:49.720698Z",
     "shell.execute_reply": "2024-11-29T09:46:49.720698Z",
     "shell.execute_reply.started": "2024-11-29T09:46:49.699348Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>人工智能</th>\n",
       "      <th>Tensorflow</th>\n",
       "      <th>Keras</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>X</th>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>52</td>\n",
       "      <td>16</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>54</td>\n",
       "      <td>36</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>58</td>\n",
       "      <td>16</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>58</td>\n",
       "      <td>2500</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <td>108</td>\n",
       "      <td>0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>117</td>\n",
       "      <td>16</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <td>119</td>\n",
       "      <td>2500</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Y</th>\n",
       "      <td>169</td>\n",
       "      <td>49</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   人工智能  Tensorflow  Keras\n",
       "X    50           1  False\n",
       "B    50           0  False\n",
       "C    52          16  False\n",
       "D    54          36  False\n",
       "E    58          16   True\n",
       "F    58        2500  False\n",
       "H   108           0   True\n",
       "I   117          16   True\n",
       "J   119        2500  False\n",
       "Y   169          49  False"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def convert(x):\n",
    "    if x > 5:\n",
    "        return True\n",
    "    else:\n",
    "        return False\n",
    "\n",
    "df.apply(\n",
    "    {\"人工智能\": np.cumsum, \"Tensorflow\": np.square, \"Keras\": convert}\n",
    ")  # 对不同的列，执行不同的操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## transform元素转变"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:40:09.541532Z",
     "iopub.status.busy": "2024-11-29T09:40:09.540529Z",
     "iopub.status.idle": "2024-11-29T09:40:09.549495Z",
     "shell.execute_reply": "2024-11-29T09:40:09.549495Z",
     "shell.execute_reply.started": "2024-11-29T09:40:09.541532Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Python</th>\n",
       "      <th>Tensorflow</th>\n",
       "      <th>Keras</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Python  Tensorflow  Keras\n",
       "A       0           3      3\n",
       "B       2           9      3\n",
       "C       4           4      0\n",
       "D       3           5      8\n",
       "E       9           4      4\n",
       "F       0           1      8\n",
       "H       3           4      7\n",
       "I       2           8      2\n",
       "J       6           2      1\n",
       "K       2           0      0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    np.random.randint(0, 10, size=(10, 3)),\n",
    "    columns=[\"Python\", \"Tensorflow\", \"Keras\"],\n",
    "    index=list(\"ABCDEFHIJK\"),\n",
    ")\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:40:27.019721Z",
     "iopub.status.busy": "2024-11-29T09:40:27.019721Z",
     "iopub.status.idle": "2024-11-29T09:40:27.030559Z",
     "shell.execute_reply": "2024-11-29T09:40:27.030559Z",
     "shell.execute_reply.started": "2024-11-29T09:40:27.019721Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A   -1024\n",
       "B   -1024\n",
       "C   -1024\n",
       "D   -1024\n",
       "E    1024\n",
       "F   -1024\n",
       "H   -1024\n",
       "I   -1024\n",
       "J    1024\n",
       "K   -1024\n",
       "Name: Python, dtype: int64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 可以针对一列数据，Series进行运算\n",
    "df[\"Python\"].transform(lambda x: 1024 if x > 5 else -1024)  # 这个功能和map，apply类似的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:44:00.848131Z",
     "iopub.status.busy": "2024-11-29T09:44:00.848131Z",
     "iopub.status.idle": "2024-11-29T09:44:00.862067Z",
     "shell.execute_reply": "2024-11-29T09:44:00.862067Z",
     "shell.execute_reply.started": "2024-11-29T09:44:00.848131Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sqrt</th>\n",
       "      <th>square</th>\n",
       "      <th>cumsum</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>1.732051</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>81</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>2.236068</td>\n",
       "      <td>25</td>\n",
       "      <td>21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>16</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>16</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>2.828427</td>\n",
       "      <td>64</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <td>1.414214</td>\n",
       "      <td>4</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       sqrt  square  cumsum\n",
       "A  1.732051       9       3\n",
       "B  3.000000      81      12\n",
       "C  2.000000      16      16\n",
       "D  2.236068      25      21\n",
       "E  2.000000      16      25\n",
       "F  1.000000       1      26\n",
       "H  2.000000      16      30\n",
       "I  2.828427      64      38\n",
       "J  1.414214       4      40\n",
       "K  0.000000       0      40"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#                       开平方\t\t平方       累加和\n",
    "df[\"Tensorflow\"].transform([np.sqrt, np.square, \"cumsum\"])  # 针对一列，进行不同的操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T09:45:02.297048Z",
     "iopub.status.busy": "2024-11-29T09:45:02.296045Z",
     "iopub.status.idle": "2024-11-29T09:45:02.317707Z",
     "shell.execute_reply": "2024-11-29T09:45:02.317707Z",
     "shell.execute_reply.started": "2024-11-29T09:45:02.297048Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Python</th>\n",
       "      <th>Tensorflow</th>\n",
       "      <th>Keras</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>2</td>\n",
       "      <td>81</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>6</td>\n",
       "      <td>16</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>9</td>\n",
       "      <td>25</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>18</td>\n",
       "      <td>16</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H</th>\n",
       "      <td>21</td>\n",
       "      <td>16</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>23</td>\n",
       "      <td>64</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>J</th>\n",
       "      <td>29</td>\n",
       "      <td>4</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>K</th>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Python  Tensorflow  Keras\n",
       "A       0           9  False\n",
       "B       2          81  False\n",
       "C       6          16  False\n",
       "D       9          25   True\n",
       "E      18          16  False\n",
       "F      18           1   True\n",
       "H      21          16   True\n",
       "I      23          64  False\n",
       "J      29           4  False\n",
       "K      31           0  False"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def convert(x):\n",
    "    if x > 5:\n",
    "        return True\n",
    "    else:\n",
    "        return False\n",
    "\n",
    "\n",
    "# 可以针对DataFrame进行运算\n",
    "df.transform(\n",
    "    {\"Python\": \"cumsum\", \"Tensorflow\": np.square, \"Keras\": convert}\n",
    ")  # 对不同的列，执行不同的操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 替换值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       1\n",
       "1    -999\n",
       "2       2\n",
       "3   -1000\n",
       "4       3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series([1,-999,2,-1000,3])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       1.0\n",
       "1       NaN\n",
       "2       2.0\n",
       "3   -1000.0\n",
       "4       3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将-999替换为nan\n",
    "data.replace(-999,np.nan)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.0\n",
       "1    NaN\n",
       "2    2.0\n",
       "3    NaN\n",
       "4    3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将-999和-1000替换为nan\n",
    "data.replace([-999,-1000],np.nan)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.0\n",
       "1    NaN\n",
       "2    2.0\n",
       "3    0.0\n",
       "4    3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将-999替换为nan，将-1000替换为0\n",
    "data1 = data.replace([-999,-1000],[np.nan,0])\n",
    "data1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.0\n",
       "1    NaN\n",
       "2    2.0\n",
       "3    0.0\n",
       "4    3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将-999替换为nan，将-1000替换为0\n",
    "# data.replace和data.str.replace 是不一样的\n",
    "data.replace({-999:np.nan,-1000:0})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 重命名轴索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "      <th>four</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BeiJing</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tokyo</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          one  two  three  four\n",
       "BeiJing     0    1      2     3\n",
       "Tokyo       4    5      6     7\n",
       "New York    8    9     10    11"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n",
    "                    index=['BeiJing', 'Tokyo', 'New York'],\n",
    "                    columns=['one', 'two', 'three', 'four'])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "      <th>four</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   one  two  three  four\n",
       "a  NaN  NaN    NaN   NaN\n",
       "b  NaN  NaN    NaN   NaN\n",
       "c  NaN  NaN    NaN   NaN"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 重新索引 改变原有索引的前后顺序，但不能起新的索引名，\n",
    "# 如果使用了新的索引名，就会变成下面情况，为nan\n",
    "data.reindex(['a', 'b', 'c'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['BEIJ', 'TOKY', 'NEW '], dtype='object')"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#大写\n",
    "tran = lambda x:x[:4].upper()\n",
    "data.index.map(tran)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "      <th>four</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BEIJ</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TOKY</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NEW</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      one  two  three  four\n",
       "BEIJ    0    1      2     3\n",
       "TOKY    4    5      6     7\n",
       "NEW     8    9     10    11"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.index = data.index.map(tran)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ONE</th>\n",
       "      <th>TWO</th>\n",
       "      <th>THREE</th>\n",
       "      <th>FOUR</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Beij</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Toky</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      ONE  TWO  THREE  FOUR\n",
       "Beij    0    1      2     3\n",
       "Toky    4    5      6     7\n",
       "New     8    9     10    11"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# rename 不修改源数据，会返回一个新的对象\n",
    "data.rename(index=str.title,columns = str.upper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>第三年</th>\n",
       "      <th>four</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BEIJ</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>东京</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NEW</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      one  two  第三年  four\n",
       "BEIJ    0    1    2     3\n",
       "东京      4    5    6     7\n",
       "NEW     8    9   10    11"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#结合字典型对象对标签更新\n",
    "data.rename(index={'TOKY':'东京'},columns={'three':'第三年'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>第三年</th>\n",
       "      <th>four</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>BEIJ</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>东京</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NEW</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      one  two  第三年  four\n",
       "BEIJ    0    1    2     3\n",
       "东京      4    5    6     7\n",
       "NEW     8    9   10    11"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.rename(index={'TOKY':'东京'},columns={'three':'第三年'},inplace = True)\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 分箱，即离散化和面元划分\n",
    "离散化：不关心数字的绝对值的大小，关心其他值比较的相对大小\n",
    "面元划分：分阶段"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:09:20.925412Z",
     "iopub.status.busy": "2024-11-29T10:09:20.925412Z",
     "iopub.status.idle": "2024-11-29T10:09:20.943263Z",
     "shell.execute_reply": "2024-11-29T10:09:20.943263Z",
     "shell.execute_reply.started": "2024-11-29T10:09:20.925412Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]\n",
       "Length: 12\n",
       "Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]\n",
    "# 面元的英文翻译= bin\n",
    "# 按照以下的这些值进行划分，分组，分为4组（左开右闭）\n",
    "bins = [18,25,35,60,100]\n",
    "cats = pd.cut(ages,bins)\n",
    "cats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:06:27.489028Z",
     "iopub.status.busy": "2024-11-29T10:06:27.489028Z",
     "iopub.status.idle": "2024-11-29T10:06:27.495545Z",
     "shell.execute_reply": "2024-11-29T10:06:27.495545Z",
     "shell.execute_reply.started": "2024-11-29T10:06:27.489028Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:06:28.905244Z",
     "iopub.status.busy": "2024-11-29T10:06:28.905244Z",
     "iopub.status.idle": "2024-11-29T10:06:28.912030Z",
     "shell.execute_reply": "2024-11-29T10:06:28.912030Z",
     "shell.execute_reply.started": "2024-11-29T10:06:28.905244Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\98614\\AppData\\Local\\Temp\\ipykernel_12436\\1485279302.py:1: FutureWarning: pandas.value_counts is deprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.\n",
      "  pd.value_counts(cats)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(18, 25]     5\n",
       "(25, 35]     3\n",
       "(35, 60]     3\n",
       "(60, 100]    1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.value_counts(cats)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]\n",
       "Length: 12\n",
       "Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按照左闭右开进行分组\n",
    "pd.cut(ages,[18,26,36,61,100],right=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['青年', '青年', '青年', '年轻人', '青年', ..., '年轻人', '老年', '中年', '中年', '年轻人']\n",
       "Length: 12\n",
       "Categories (4, object): ['青年' < '年轻人' < '中年' < '老年']"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names = ['青年','年轻人','中年','老年']\n",
    "pd.cut(ages,bins,labels=names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.12215183, 0.43826004, 0.46309567, 0.15113623, 0.96156624,\n",
       "       0.80866036, 0.37999692, 0.19352914, 0.58757469, 0.84196798,\n",
       "       0.88651405, 0.73552967, 0.03945823, 0.2032477 , 0.29898101,\n",
       "       0.44173128, 0.79636009, 0.47945521, 0.96159697, 0.02057208])"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = np.random.rand(20)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0.02, 0.26], (0.26, 0.49], (0.26, 0.49], (0.02, 0.26], (0.73, 0.96], ..., (0.26, 0.49], (0.73, 0.96], (0.26, 0.49], (0.73, 0.96], (0.02, 0.26]]\n",
       "Length: 20\n",
       "Categories (4, interval[float64, right]): [(0.02, 0.26] < (0.26, 0.49] < (0.49, 0.73] < (0.73, 0.96]]"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 指定了分组个数为4，会按数值的最大和最小值分成4个等长的面元\n",
    "pd.cut(data,4,precision = 2)   #小数位数2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:02:32.955725Z",
     "iopub.status.busy": "2024-11-29T10:02:32.955725Z",
     "iopub.status.idle": "2024-11-29T10:02:32.967032Z",
     "shell.execute_reply": "2024-11-29T10:02:32.967032Z",
     "shell.execute_reply.started": "2024-11-29T10:02:32.955725Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.18761943,  0.37297001,  0.05074368, -0.26718791,  0.98470898,\n",
       "       -1.02545756,  0.97694465,  0.30876132,  0.46213364,  1.29540278,\n",
       "        0.80884017, -0.67308681, -0.10956084, -1.38643953, -0.55157773,\n",
       "       -0.43348111, -0.77524503, -1.38923929,  1.84964289,  0.41107935,\n",
       "        0.6768769 , -0.54324481, -0.77525195, -0.91304312, -0.92085869,\n",
       "        1.11128522,  0.66632717, -0.58084622, -0.50226364,  0.11487268,\n",
       "        1.23468185, -0.96156078,  0.59629221, -0.14822693, -1.08618225,\n",
       "       -0.66890762, -0.56717191, -1.60825738,  0.12519017,  0.70228449,\n",
       "       -0.42076862,  3.06028078, -0.36648796, -0.90457939,  1.06218885,\n",
       "       -0.88833568, -0.34875116, -0.39644928, -0.47878824, -0.59911383,\n",
       "       -1.49113051,  0.15310318,  0.13806463, -0.41787798,  0.52193529,\n",
       "        0.1252243 , -1.02795173, -0.02559418,  1.13745872, -0.99192926,\n",
       "        0.63063423, -0.48395388, -0.34187911,  0.10284009, -1.26774816,\n",
       "       -1.3833798 , -0.38528402,  0.45203066, -0.85616218, -0.48876656,\n",
       "        0.47068996,  0.70623936,  0.3324506 , -0.1463777 , -0.31104614,\n",
       "       -1.51127762,  1.24608393,  0.60098486,  1.21390242,  0.21730012,\n",
       "        1.04079875,  0.13939846, -1.89218844,  1.1828825 ,  0.96040777,\n",
       "       -0.06079249,  0.76514501,  1.13157204,  0.07669515, -0.76797882,\n",
       "        0.17820872, -0.50633264, -0.10474705,  1.23821979, -0.98072106,\n",
       "        0.23518468,  1.61784762, -0.25154988, -0.57957937, -0.83841859])"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = np.random.randn(100)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:02:35.536296Z",
     "iopub.status.busy": "2024-11-29T10:02:35.536296Z",
     "iopub.status.idle": "2024-11-29T10:02:35.550598Z",
     "shell.execute_reply": "2024-11-29T10:02:35.549440Z",
     "shell.execute_reply.started": "2024-11-29T10:02:35.536296Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\98614\\AppData\\Local\\Temp\\ipykernel_21444\\1714426358.py:3: FutureWarning: pandas.value_counts is deprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.\n",
      "  pd.value_counts(cats)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(-1.8929999999999998, -0.617]    25\n",
       "(-0.617, -0.128]                 25\n",
       "(-0.128, 0.597]                  25\n",
       "(0.597, 3.06]                    25\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# qcut函数，能够得到每个分组的数据个数是相等的\n",
    "cats = pd.qcut(data,4)\n",
    "pd.value_counts(cats)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:02:37.471704Z",
     "iopub.status.busy": "2024-11-29T10:02:37.470530Z",
     "iopub.status.idle": "2024-11-29T10:02:37.489591Z",
     "shell.execute_reply": "2024-11-29T10:02:37.488499Z",
     "shell.execute_reply.started": "2024-11-29T10:02:37.471704Z"
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\98614\\AppData\\Local\\Temp\\ipykernel_21444\\3852803508.py:3: FutureWarning: pandas.value_counts is deprecated and will be removed in a future version. Use pd.Series(obj).value_counts() instead.\n",
      "  pd.value_counts(cats)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(-1.026, -0.128]                 40\n",
       "(-0.128, 1.132]                  40\n",
       "(-1.8929999999999998, -1.026]    10\n",
       "(1.132, 3.06]                    10\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按照区间的大小所占整个的百分比，来划分每个面元里的数据个数\n",
    "cats = pd.qcut(data,[0,0.1,0.5,0.9,1.])\n",
    "pd.value_counts(cats)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T10:10:01.749866Z",
     "iopub.status.busy": "2024-11-29T10:10:01.749866Z",
     "iopub.status.idle": "2024-11-29T10:10:01.769971Z",
     "shell.execute_reply": "2024-11-29T10:10:01.768970Z",
     "shell.execute_reply.started": "2024-11-29T10:10:01.749866Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Python</th>\n",
       "      <th>Math</th>\n",
       "      <th>En</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30</td>\n",
       "      <td>118</td>\n",
       "      <td>109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>42</td>\n",
       "      <td>121</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>131</td>\n",
       "      <td>108</td>\n",
       "      <td>129</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>119</td>\n",
       "      <td>117</td>\n",
       "      <td>146</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>106</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>15</td>\n",
       "      <td>105</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>134</td>\n",
       "      <td>44</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>135</td>\n",
       "      <td>76</td>\n",
       "      <td>116</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>92</td>\n",
       "      <td>6</td>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>38</td>\n",
       "      <td>63</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    Python  Math   En\n",
       "0       30   118  109\n",
       "1       42   121   93\n",
       "2      131   108  129\n",
       "3      119   117  146\n",
       "4        2   106  110\n",
       "..     ...   ...  ...\n",
       "95      15   105    4\n",
       "96     134    44   40\n",
       "97     135    76  116\n",
       "98      92     6   85\n",
       "99      38    63   27\n",
       "\n",
       "[100 rows x 3 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    np.random.randint(0, 151, size=(100, 3)), columns=[\"Python\", \"Math\", \"En\"]\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T10:15:50.500687Z",
     "iopub.status.busy": "2024-11-29T10:15:50.500687Z",
     "iopub.status.idle": "2024-11-29T10:15:50.519977Z",
     "shell.execute_reply": "2024-11-29T10:15:50.519977Z",
     "shell.execute_reply.started": "2024-11-29T10:15:50.500687Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     不及格\n",
       "1      及格\n",
       "2      优秀\n",
       "3      优秀\n",
       "4     不及格\n",
       "     ... \n",
       "95    不及格\n",
       "96     优秀\n",
       "97     优秀\n",
       "98     中等\n",
       "99    不及格\n",
       "Name: Python, Length: 100, dtype: category\n",
       "Categories (4, object): ['不及格' < '及格' < '中等' < '优秀']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 等宽\n",
    "score = pd.cut(\n",
    "    df.Python, bins=4, labels=[\"不及格\", \"及格\", \"中等\", \"优秀\"], right=True  # 等宽分成四份\n",
    ")  # 区间，闭区间\n",
    "score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2024-11-29T10:15:47.470206Z",
     "iopub.status.busy": "2024-11-29T10:15:47.469197Z",
     "iopub.status.idle": "2024-11-29T10:15:47.489571Z",
     "shell.execute_reply": "2024-11-29T10:15:47.488205Z",
     "shell.execute_reply.started": "2024-11-29T10:15:47.470206Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['不及格', '及格', '中等', '优秀'], dtype='object')"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "score.cat.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T10:16:46.533599Z",
     "iopub.status.busy": "2024-11-29T10:16:46.532596Z",
     "iopub.status.idle": "2024-11-29T10:16:46.550660Z",
     "shell.execute_reply": "2024-11-29T10:16:46.549657Z",
     "shell.execute_reply.started": "2024-11-29T10:16:46.533599Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      (1.852, 39.0]\n",
       "1       (39.0, 76.0]\n",
       "2     (113.0, 150.0]\n",
       "3     (113.0, 150.0]\n",
       "4      (1.852, 39.0]\n",
       "           ...      \n",
       "95     (1.852, 39.0]\n",
       "96    (113.0, 150.0]\n",
       "97    (113.0, 150.0]\n",
       "98     (76.0, 113.0]\n",
       "99     (1.852, 39.0]\n",
       "Name: Python, Length: 100, dtype: category\n",
       "Categories (4, interval[float64, right]): [(1.852, 39.0] < (39.0, 76.0] < (76.0, 113.0] < (113.0, 150.0]]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 等宽\n",
    "pd.cut(\n",
    "    df.Python, \n",
    "    bins=4, \n",
    "#     labels=[\"不及格\", \"及格\", \"中等\", \"优秀\"], \n",
    "    right=True \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T10:52:01.407516Z",
     "iopub.status.busy": "2024-11-29T10:52:01.407516Z",
     "iopub.status.idle": "2024-11-29T10:52:01.417775Z",
     "shell.execute_reply": "2024-11-29T10:52:01.417775Z",
     "shell.execute_reply.started": "2024-11-29T10:52:01.407516Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Python\n",
       "不及格    31\n",
       "及格     28\n",
       "优秀     24\n",
       "中等     17\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.cut(\n",
    "    df.Python, bins=4, labels=[\"不及格\", \"及格\", \"中等\", \"优秀\"], right=True  # 等宽分成四份\n",
    ").value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T10:52:09.377893Z",
     "iopub.status.busy": "2024-11-29T10:52:09.376890Z",
     "iopub.status.idle": "2024-11-29T10:52:09.386021Z",
     "shell.execute_reply": "2024-11-29T10:52:09.386021Z",
     "shell.execute_reply.started": "2024-11-29T10:52:09.377893Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Python</th>\n",
       "      <th>Math</th>\n",
       "      <th>En</th>\n",
       "      <th>等级</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30</td>\n",
       "      <td>118</td>\n",
       "      <td>109</td>\n",
       "      <td>极差</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>42</td>\n",
       "      <td>121</td>\n",
       "      <td>93</td>\n",
       "      <td>不及格</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>131</td>\n",
       "      <td>108</td>\n",
       "      <td>129</td>\n",
       "      <td>优秀</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>119</td>\n",
       "      <td>117</td>\n",
       "      <td>146</td>\n",
       "      <td>中等</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>106</td>\n",
       "      <td>110</td>\n",
       "      <td>极差</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>15</td>\n",
       "      <td>105</td>\n",
       "      <td>4</td>\n",
       "      <td>极差</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>134</td>\n",
       "      <td>44</td>\n",
       "      <td>40</td>\n",
       "      <td>优秀</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>135</td>\n",
       "      <td>76</td>\n",
       "      <td>116</td>\n",
       "      <td>优秀</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>92</td>\n",
       "      <td>6</td>\n",
       "      <td>85</td>\n",
       "      <td>中等</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>38</td>\n",
       "      <td>63</td>\n",
       "      <td>27</td>\n",
       "      <td>不及格</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    Python  Math   En   等级\n",
       "0       30   118  109   极差\n",
       "1       42   121   93  不及格\n",
       "2      131   108  129   优秀\n",
       "3      119   117  146   中等\n",
       "4        2   106  110   极差\n",
       "..     ...   ...  ...  ...\n",
       "95      15   105    4   极差\n",
       "96     134    44   40   优秀\n",
       "97     135    76  116   优秀\n",
       "98      92     6   85   中等\n",
       "99      38    63   27  不及格\n",
       "\n",
       "[100 rows x 4 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"等级\"] = pd.cut(\n",
    "    df.Python, bins=[0, 30, 60, 90, 120, 150], labels=[\"极差\", \"不及格\", \"及格\", \"中等\", \"优秀\"]\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-29T10:53:09.469215Z",
     "iopub.status.busy": "2024-11-29T10:53:09.467754Z",
     "iopub.status.idle": "2024-11-29T10:53:09.488765Z",
     "shell.execute_reply": "2024-11-29T10:53:09.487684Z",
     "shell.execute_reply.started": "2024-11-29T10:53:09.469215Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Python\n",
       "不及格    28\n",
       "中等     25\n",
       "优秀     25\n",
       "及格     22\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 等频\n",
    "pd.qcut(df.Python, q=4, labels=[\"不及格\", \"及格\", \"中等\", \"优秀\"]).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 检测和过滤异常值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.255005</td>\n",
       "      <td>-0.070266</td>\n",
       "      <td>-0.216659</td>\n",
       "      <td>-0.920589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.007590</td>\n",
       "      <td>-1.585268</td>\n",
       "      <td>-0.171354</td>\n",
       "      <td>-0.333173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.599092</td>\n",
       "      <td>0.457906</td>\n",
       "      <td>0.047819</td>\n",
       "      <td>-0.517866</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.533756</td>\n",
       "      <td>0.355982</td>\n",
       "      <td>-0.719796</td>\n",
       "      <td>1.224066</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.952753</td>\n",
       "      <td>0.517998</td>\n",
       "      <td>-0.851118</td>\n",
       "      <td>0.994973</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>-1.936705</td>\n",
       "      <td>0.131146</td>\n",
       "      <td>0.175184</td>\n",
       "      <td>1.422118</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>0.751033</td>\n",
       "      <td>-0.036077</td>\n",
       "      <td>1.535488</td>\n",
       "      <td>-0.191239</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>0.160412</td>\n",
       "      <td>1.118875</td>\n",
       "      <td>-0.617703</td>\n",
       "      <td>-0.822643</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>-1.289580</td>\n",
       "      <td>-0.217603</td>\n",
       "      <td>-1.717825</td>\n",
       "      <td>0.082487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>-0.293560</td>\n",
       "      <td>0.741841</td>\n",
       "      <td>-1.438634</td>\n",
       "      <td>0.216693</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            0         1         2         3\n",
       "0    0.255005 -0.070266 -0.216659 -0.920589\n",
       "1   -0.007590 -1.585268 -0.171354 -0.333173\n",
       "2   -0.599092  0.457906  0.047819 -0.517866\n",
       "3   -0.533756  0.355982 -0.719796  1.224066\n",
       "4    0.952753  0.517998 -0.851118  0.994973\n",
       "..        ...       ...       ...       ...\n",
       "995 -1.936705  0.131146  0.175184  1.422118\n",
       "996  0.751033 -0.036077  1.535488 -0.191239\n",
       "997  0.160412  1.118875 -0.617703 -0.822643\n",
       "998 -1.289580 -0.217603 -1.717825  0.082487\n",
       "999 -0.293560  0.741841 -1.438634  0.216693\n",
       "\n",
       "[1000 rows x 4 columns]"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame(np.random.randn(1000, 4))\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>-0.491725</td>\n",
       "      <td>3.357212</td>\n",
       "      <td>0.324261</td>\n",
       "      <td>0.555546</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>0.554304</td>\n",
       "      <td>-3.275391</td>\n",
       "      <td>-1.954480</td>\n",
       "      <td>-1.015726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>355</th>\n",
       "      <td>2.325047</td>\n",
       "      <td>1.141355</td>\n",
       "      <td>-0.065538</td>\n",
       "      <td>3.808601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>401</th>\n",
       "      <td>1.039486</td>\n",
       "      <td>3.170118</td>\n",
       "      <td>-1.018711</td>\n",
       "      <td>1.995579</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>539</th>\n",
       "      <td>-0.362203</td>\n",
       "      <td>-3.166418</td>\n",
       "      <td>-0.191452</td>\n",
       "      <td>-0.211393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>577</th>\n",
       "      <td>0.157298</td>\n",
       "      <td>-3.194854</td>\n",
       "      <td>0.618818</td>\n",
       "      <td>-1.744634</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>786</th>\n",
       "      <td>-0.802897</td>\n",
       "      <td>-1.363597</td>\n",
       "      <td>2.114546</td>\n",
       "      <td>3.249602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>846</th>\n",
       "      <td>-3.294286</td>\n",
       "      <td>-0.052303</td>\n",
       "      <td>-0.685119</td>\n",
       "      <td>-2.300158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>848</th>\n",
       "      <td>1.220413</td>\n",
       "      <td>0.068838</td>\n",
       "      <td>0.171057</td>\n",
       "      <td>-3.055634</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>912</th>\n",
       "      <td>0.204733</td>\n",
       "      <td>-0.288201</td>\n",
       "      <td>3.102370</td>\n",
       "      <td>-0.568155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>983</th>\n",
       "      <td>3.448719</td>\n",
       "      <td>-2.701071</td>\n",
       "      <td>1.030714</td>\n",
       "      <td>-0.550708</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            0         1         2         3\n",
       "12  -0.491725  3.357212  0.324261  0.555546\n",
       "192  0.554304 -3.275391 -1.954480 -1.015726\n",
       "355  2.325047  1.141355 -0.065538  3.808601\n",
       "401  1.039486  3.170118 -1.018711  1.995579\n",
       "539 -0.362203 -3.166418 -0.191452 -0.211393\n",
       "577  0.157298 -3.194854  0.618818 -1.744634\n",
       "786 -0.802897 -1.363597  2.114546  3.249602\n",
       "846 -3.294286 -0.052303 -0.685119 -2.300158\n",
       "848  1.220413  0.068838  0.171057 -3.055634\n",
       "912  0.204733 -0.288201  3.102370 -0.568155\n",
       "983  3.448719 -2.701071  1.030714 -0.550708"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# .any(axis=1) 如果哪一行上出现了绝对值大于3的数据，就取到该行\n",
    "data[(np.abs(data) > 3).any(axis=1)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.007760</td>\n",
       "      <td>0.002842</td>\n",
       "      <td>0.043337</td>\n",
       "      <td>0.011245</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.010928</td>\n",
       "      <td>1.025304</td>\n",
       "      <td>1.005955</td>\n",
       "      <td>1.022716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-2.899164</td>\n",
       "      <td>-2.829475</td>\n",
       "      <td>-2.792239</td>\n",
       "      <td>-2.738879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-0.707535</td>\n",
       "      <td>-0.686161</td>\n",
       "      <td>-0.658122</td>\n",
       "      <td>-0.732579</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.038936</td>\n",
       "      <td>-0.037068</td>\n",
       "      <td>0.014515</td>\n",
       "      <td>-0.016868</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.722471</td>\n",
       "      <td>0.689378</td>\n",
       "      <td>0.741652</td>\n",
       "      <td>0.693880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 0            1            2            3\n",
       "count  1000.000000  1000.000000  1000.000000  1000.000000\n",
       "mean      0.007760     0.002842     0.043337     0.011245\n",
       "std       1.010928     1.025304     1.005955     1.022716\n",
       "min      -2.899164    -2.829475    -2.792239    -2.738879\n",
       "25%      -0.707535    -0.686161    -0.658122    -0.732579\n",
       "50%       0.038936    -0.037068     0.014515    -0.016868\n",
       "75%       0.722471     0.689378     0.741652     0.693880\n",
       "max       3.000000     3.000000     3.000000     3.000000"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[np.abs(data)>3] = 3\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 排列和随机采样"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "0   0   1   2   3\n",
       "1   4   5   6   7\n",
       "2   8   9  10  11\n",
       "3  12  13  14  15\n",
       "4  16  17  18  19"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 2, 4, 1, 0], dtype=int32)"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 随机顺序的数组\n",
    "#新顺序数组\n",
    "sam = np.random.permutation(5)\n",
    "sam"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "3  12  13  14  15\n",
       "2   8   9  10  11\n",
       "4  16  17  18  19\n",
       "1   4   5   6   7\n",
       "0   0   1   2   3"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#排列\n",
    "df.take(sam)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "3  12  13  14  15"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 随机选取n行\n",
    "df.sample(n=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    5\n",
       "1    7\n",
       "2    1\n",
       "3    6\n",
       "4    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ch = pd.Series([5,7,1,6,3])\n",
    "ch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    7\n",
       "3    6\n",
       "4    3\n",
       "2    1\n",
       "4    3\n",
       "4    3\n",
       "2    1\n",
       "3    6\n",
       "4    3\n",
       "2    1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 随机选取10个数据，replace为true表示会重复选取同一个值\n",
    "ch.sample(n=10,replace = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# 字符串操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 字符串对象方法\n",
    "|方法|\t说明|\n",
    "|---|---|\n",
    "|count|\t返回子串在字符串中的出现次数(非重叠)|\n",
    "|endswith. startswith| \t如果字符串 以某个后缀结尾(以某个前缀开头)，则返回True |\n",
    "|join\t|将字符串用作连接其他字符串序列的分隔符|\n",
    "|index\t|如果在字符串中找到子串，则返回子串第一个字符所在的位置。如果没有找到，则引发ValueError。|\n",
    "|find\t|如果在字符串中找到子串，则返回第一个发现的子串的第一个字符所在的位置。如果没有找到，则返回-1|\n",
    "|rfind\t|如果在字符串中找到子串，则返回最后一个发现的子串的第一个字符所在的位置。如果没有找到，则返回-1 |\n",
    "|replace|\t用另一个字符串替换指定子串|\n",
    "|strip、rstrip、 Istrip| \t去除空白符(包括换行符)。相当于对各个元素执行x.strip()(以及rstrip、Istrip) 。|\n",
    "|split\t|通过指定的分隔符将字符串拆分为- -组子串|\n",
    "|lower、upper|\t分别将字母字符转换为小写或大写.|\n",
    "|ljust、rjust|\t用空格(或其他字符)填充字符串的空白侧以返回符合最低宽度的字符串|\n",
    "\n",
    "## 正则表达式\n",
    "|方法\t|说明|\n",
    "|---|---|\n",
    "|findall、finditer\t|返回字符串中所东的非重叠匹配模式。findall返回的是由所有模式组成的列表，而finditer则通过一个迭代器逐个返回|\n",
    "|match|\t从字符串起始位置匹配模式，还可以对模式各部分进行分组。如果匹配到模式，则返回一个匹配项对象，否则返回None|\n",
    "|search\t|扫描整个字符串以匹配模式。如果找到则返回一个匹配项对象。跟match不同，其匹配项可以位于字符串的任意位置，而不仅仅是起始处|\n",
    "|split\t|根据找到的模式将字符串拆分为数段|\n",
    "|sub、subn\t|将字符串中所有的 (sub)或前n个 (subn)模式替换为指定表达式。在替换字符串中可以通过\\1、\\2等符号表示各分组项|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'bat', 'qq']"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "text = \"foo  bar\\t bat  \\tqq\"\n",
    "re.split('\\s+',text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'bat', 'qq']"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = re.compile('\\s+')\n",
    "res.split(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'bat', 'qq']"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reg = re.split(res,text)\n",
    "reg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['  ', '\\t ', '  \\t']"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res.findall(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['  ', '\\t ', '  \\t']"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(res,text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'foo9bar9bat9qq'"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(res,'9',text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "f\n",
      "b\n"
     ]
    }
   ],
   "source": [
    "t1 = re.match('f',text)\n",
    "print(t1.group())\n",
    "print(re.search('b',text).group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# pandas的矢量化字符串函数\n",
    "|方法|\t说明|\n",
    "|---|---|\n",
    "|cat|\t实现元素级的字符串连接操作，可指定分隔符|\n",
    "|count|\t返回表示个字符串是否含有指定模式的布尔型数组|\n",
    "|extract|\t使用带分组的正则表达式从字符串Series提取一个或多个字符串,结果是一个DataFrame,每组有一列|\n",
    "|endswith|\t相当于对每个元素执行x.endswith(pattern)|\n",
    "|startswith\t|相当于对每个元素执行x.startswith(pattern)|\n",
    "|findall|\t计算各字符串的模式列表|\n",
    "|get\t|获取各元素的第i个字符|\n",
    "|isalnum\t|相当于内置的str.alnum|\n",
    "|isalpha\t|相当于内置的str.isalpha|\n",
    "|isdecimal\t|相当于内置的str.isdecimal|\n",
    "|isdigit\t|相当于内置的str.isdigit|\n",
    "|islower\t|相当于内置的str.islower|\n",
    "|isupper\t|相当于内置的str.isupper|\n",
    "|join\t|根据指定的分隔符将Series 中各元素的字符串连接起来|\n",
    "|len|\t计算各字符串的长度|\n",
    "|lower,upper|\t转换大小写。相当于对各个元素执行x.lower()或x.upper()|\n",
    "|match\t|根据指定的正则表达式对各个元素执行re.match，返回匹配的组为列表|\n",
    "|pad|\t在字符串的左边、右边或两边添加空白符|\n",
    "|center|\t相当于pad(side='both')|\n",
    "|repeat|\t重复值。例如，s.str.repeat(3)相当于对各个字符串执行x*3|\n",
    "|replace|\t用指定字符串替换找到的模式|\n",
    "|slice|\t对Series中的各个字符串进行子串截取|\n",
    "|split|\t根据分隔符或正则表达式对字符串进行拆分|\n",
    "|strip|\t去除两边的空白符，包括新行|\n",
    "|rstrip\t|去除右边的空白符|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a        dave@qq.com\n",
       "b    steve@gmail.com\n",
       "c      sam@gmail.com\n",
       "d                NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = {'a': 'dave@qq.com', 'b': 'steve@gmail.com',\n",
    "        'c': 'sam@gmail.com', 'd': np.nan}\n",
    "data = pd.Series(data)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    False\n",
       "b    False\n",
       "c    False\n",
       "d     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'float' object has no attribute 'split'",
     "output_type": "error",
     "traceback": [
      "\u001B[1;31m---------------------------------------------------------------------------\u001B[0m",
      "\u001B[1;31mAttributeError\u001B[0m                            Traceback (most recent call last)",
      "Cell \u001B[1;32mIn[82], line 1\u001B[0m\n\u001B[1;32m----> 1\u001B[0m \u001B[43mdata\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mmap\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43;01mlambda\u001B[39;49;00m\u001B[43m \u001B[49m\u001B[43mx\u001B[49m\u001B[43m \u001B[49m\u001B[43m:\u001B[49m\u001B[43mx\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43msplit\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[38;5;124;43m@\u001B[39;49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[1;32mD:\\soft_install\\anaconda\\anaconda3\\envs\\QiXiaoFu\\lib\\site-packages\\pandas\\core\\series.py:4700\u001B[0m, in \u001B[0;36mSeries.map\u001B[1;34m(self, arg, na_action)\u001B[0m\n\u001B[0;32m   4620\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mmap\u001B[39m(\n\u001B[0;32m   4621\u001B[0m     \u001B[38;5;28mself\u001B[39m,\n\u001B[0;32m   4622\u001B[0m     arg: Callable \u001B[38;5;241m|\u001B[39m Mapping \u001B[38;5;241m|\u001B[39m Series,\n\u001B[0;32m   4623\u001B[0m     na_action: Literal[\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mignore\u001B[39m\u001B[38;5;124m\"\u001B[39m] \u001B[38;5;241m|\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[0;32m   4624\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m Series:\n\u001B[0;32m   4625\u001B[0m \u001B[38;5;250m    \u001B[39m\u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[0;32m   4626\u001B[0m \u001B[38;5;124;03m    Map values of Series according to an input mapping or function.\u001B[39;00m\n\u001B[0;32m   4627\u001B[0m \n\u001B[1;32m   (...)\u001B[0m\n\u001B[0;32m   4698\u001B[0m \u001B[38;5;124;03m    dtype: object\u001B[39;00m\n\u001B[0;32m   4699\u001B[0m \u001B[38;5;124;03m    \"\"\"\u001B[39;00m\n\u001B[1;32m-> 4700\u001B[0m     new_values \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_map_values\u001B[49m\u001B[43m(\u001B[49m\u001B[43marg\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mna_action\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mna_action\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m   4701\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_constructor(new_values, index\u001B[38;5;241m=\u001B[39m\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mindex, copy\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mFalse\u001B[39;00m)\u001B[38;5;241m.\u001B[39m__finalize__(\n\u001B[0;32m   4702\u001B[0m         \u001B[38;5;28mself\u001B[39m, method\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mmap\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[0;32m   4703\u001B[0m     )\n",
      "File \u001B[1;32mD:\\soft_install\\anaconda\\anaconda3\\envs\\QiXiaoFu\\lib\\site-packages\\pandas\\core\\base.py:921\u001B[0m, in \u001B[0;36mIndexOpsMixin._map_values\u001B[1;34m(self, mapper, na_action, convert)\u001B[0m\n\u001B[0;32m    918\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(arr, ExtensionArray):\n\u001B[0;32m    919\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m arr\u001B[38;5;241m.\u001B[39mmap(mapper, na_action\u001B[38;5;241m=\u001B[39mna_action)\n\u001B[1;32m--> 921\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43malgorithms\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mmap_array\u001B[49m\u001B[43m(\u001B[49m\u001B[43marr\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmapper\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mna_action\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mna_action\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mconvert\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mconvert\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[1;32mD:\\soft_install\\anaconda\\anaconda3\\envs\\QiXiaoFu\\lib\\site-packages\\pandas\\core\\algorithms.py:1743\u001B[0m, in \u001B[0;36mmap_array\u001B[1;34m(arr, mapper, na_action, convert)\u001B[0m\n\u001B[0;32m   1741\u001B[0m values \u001B[38;5;241m=\u001B[39m arr\u001B[38;5;241m.\u001B[39mastype(\u001B[38;5;28mobject\u001B[39m, copy\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mFalse\u001B[39;00m)\n\u001B[0;32m   1742\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m na_action \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m-> 1743\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mlib\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mmap_infer\u001B[49m\u001B[43m(\u001B[49m\u001B[43mvalues\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmapper\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mconvert\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mconvert\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m   1744\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[0;32m   1745\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m lib\u001B[38;5;241m.\u001B[39mmap_infer_mask(\n\u001B[0;32m   1746\u001B[0m         values, mapper, mask\u001B[38;5;241m=\u001B[39misna(values)\u001B[38;5;241m.\u001B[39mview(np\u001B[38;5;241m.\u001B[39muint8), convert\u001B[38;5;241m=\u001B[39mconvert\n\u001B[0;32m   1747\u001B[0m     )\n",
      "File \u001B[1;32mlib.pyx:2972\u001B[0m, in \u001B[0;36mpandas._libs.lib.map_infer\u001B[1;34m()\u001B[0m\n",
      "Cell \u001B[1;32mIn[82], line 1\u001B[0m, in \u001B[0;36m<lambda>\u001B[1;34m(x)\u001B[0m\n\u001B[1;32m----> 1\u001B[0m data\u001B[38;5;241m.\u001B[39mmap(\u001B[38;5;28;01mlambda\u001B[39;00m x :\u001B[43mx\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43msplit\u001B[49m(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124m@\u001B[39m\u001B[38;5;124m'\u001B[39m))\n",
      "\u001B[1;31mAttributeError\u001B[0m: 'float' object has no attribute 'split'"
     ]
    }
   ],
   "source": [
    "data.map(lambda x :x.split('@'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a        [dave, qq.com]\n",
       "b    [steve, gmail.com]\n",
       "c      [sam, gmail.com]\n",
       "d                   NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过.str可以忽略nan的部分\n",
    "data.str.split('@')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a        [dave, qq.com]\n",
       "b    [steve, gmail.com]\n",
       "c      [sam, gmail.com]\n",
       "dtype: object"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[data.notnull()].map(lambda x :x.split('@'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    dave@\n",
       "b    steve\n",
       "c    sam@g\n",
       "d      NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.str[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
